Data Warehouse
Ressources
What is a Data Warehouse
A data warehouse is a centralized repository that stores structured, semi-structured data from multiple sources in a predefined schema, specifically designed for data analysis, BI, and reporting.
It contains historical data that has been cleaned, transformed, and organized according to a specific business model or requirement.
Data must be processed and transformed (ETL - Extract, Transform, Load) or ELT (Extract, Load, Transform) before being stored in the warehouse.
In general, data warehouses are optimized for Gigabytes to Petabytes of data.
Why use a Data Warehouse
- Optimized for Analysis: Data is structured and organized for optimal query performance and business intelligence.
- Data Consistency: Enforces data quality and provides a single source of truth for business reporting.
- Historical Analysis: Maintains historical data for trend analysis and business intelligence.
- Business Focus: Designed specifically for business users and analysts to access standardized data for reporting and decision-making.
Challenges of Data Warehouses
- Schema Rigidity: Predefined schema makes it difficult to adapt to new data types or changing business requirements.
- Cost: High setup and maintenance costs due to specialized hardware and software requirements.
- Time to Value: Implementing ETL processes and maintaining data models can be time-consuming.
- Limited Data Types: Primarily designed for structured data, making it difficult to handle unstructured or semi-structured data formats.
OLAP Systems
Data warehouses are built on OLAP (Online Analytical Processing) systems, which differ significantly from OLTP (Online Transaction Processing) systems used in operational databases.
Key Differences
Characteristic | OLTP | OLAP |
---|---|---|
Purpose | Daily transactions processing | Data analysis and reporting |
Data Model | Normalized | Denormalized |
Data Updates | Continuous updates | Periodic batch updates |
Query Type | Simple transactions, few records | Complex queries, many records |
Response Time | Milliseconds | Seconds to minutes |
Space Requirements | Gigabytes | Terabytes to Petabytes |
Data History | Current data (days/months) | Historical data (years) |
Optimization | For insert/update | For complex queries |